Instacart is a grocery delivery startup. They facilitate deliveries of on-demand grocery orders to homes within an hour in major US cites. As of June 2017 the company valuation is at $3.4 billion. They recently published a dataset of groceries data which is perfect for market basket analysis. In this post, I intend to learn some interesting patterns to determine what products customers purchase together.

The Data

The dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. You can download the dataset here

The Algorithm

Apriori is an algorithm that mines transaction databases for frequent itemsets especially when items are associated with each other. This can be very useful for analyzing shopping carts and finding interesting patterns. I will be using the arules package in R which provides the facility to implement apriori.

Using the apriori algorithm, we might find rules that indicate grocery items that are commonly bought together with other items. For instance, the rule {Peanut Butter, Jelly} => {Bread} indicates that customers who purchase Peanut Butter & Jelly are likely to purchase Bread. This information can be very useful for product recommendations, cross marketing, and promotions to increase sales.

There are three parameters that are key to understanding the apriori algorithm, they include Confidence, Support, and Lift.


Load packages

library(arules)
library(arulesViz)
library(tidyverse)
library(plotly)


Import Data

order_products = read.csv("order_products_train.csv", nrows = 750000)
products = read.csv("products.csv")


Data Wrangling

 data = 
  left_join(order_products, products, by = "product_id") %>%
  select(order_id, product_name)


Inspect Dataset

data


Convert dataframe into transactions as required by arules package. The transactions data format forces a dataframe into a sparse matrix with 39320 transactions (rows) and 30896 items (columns). Each cell in the matrix records 1 if customer placed an order for product, if not then O.

write_csv(data, 'data.csv')
transactions = read.transactions('basket_tmp.csv', format = "single", sep = ",", cols = c("order_id", "product_name"))
transactions
transactions in sparse format with
 39320 transactions (rows) and
 30896 items (columns)


Inspect a few transactions.

inspect(transactions[1:5])
    items                                               transactionID
[1] {Bag of Organic Bananas,                                         
     Bulgarian Yogurt,                                               
     Cucumber Kirby,                                                 
     Lightly Smoked Sardines in Olive Oil,                           
     Organic 4% Milk Fat Whole Milk Cottage Cheese,                  
     Organic Celery Hearts,                                          
     Organic Hass Avocado,                                           
     Organic Whole String Cheese}                             1      
[2] {Apple Sauce,                                                    
     Original Real Crumbled Bacon}                            1000162
[3] {ProteinPLUS Multigrain Penne Pasta,                             
     Sesame Topped Hamburger Buns}                            1000197
[4] {Curate Melon Pomelo Sparking Water,                             
     Milk, Vitamin D,                                                
     Organic Super Fruit Punch Juice Drink,                          
     Sausage  Links}                                          1000209
[5] {Coconut Milk Virgin Chocolate Bar,                              
     Disinfecting Bathroom Cleaner - Lemongrass Citrus,              
     Green Clary Sage & Citrus All-Purpose Cleaner,                  
     Ramen, Vegan, Miso,                                             
     The Ring Vegetable Brush}                                1000222


Plot of the top 20 items that shows up frequently in transactions.

itemFrequencyPlot(transactions, type="absolute", top = 20, cex.names = 0.9, border = NA )


Frequent itemsets.

Fruit and vegetables are very popular grocery carts items.

basket_sets = apriori(transactions, parameter = list(supp=0.002, minlen=2), control = list(verbose = FALSE))
basket_set_dataframe = DATAFRAME(sort(itemsets, by="support", decreasing = T))
basket_set_dataframe


Plot of frequent itemsets.

par(mar=c(3, 20, 0, 0) + .4)
barplot(basket_set_dataframe$support, names = basket_set_dataframe$items, las = 2, horiz = T, cex.names = 0.9, border = NA, cex.axis = .8 )


Rules Learning

This is the heart of the apriori algorithm where we set up a rules object to mine frequent itemsets.

rules = apriori(transactions, parameter = list(support = 0.001, confidence=0.25, minlen = 2 ))


Summary of Rules

summary(rules)
set of 522 rules

rule length distribution (lhs + rhs):sizes
  2   3   4 
155 353  14 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.00    3.00    2.73    3.00    4.00 

summary of quality measures:
    support           confidence          lift       
 Min.   :0.001017   Min.   :0.2500   Min.   : 1.915  
 1st Qu.:0.001144   1st Qu.:0.2716   1st Qu.: 2.629  
 Median :0.001373   Median :0.3003   Median : 3.751  
 Mean   :0.001943   Mean   :0.3248   Mean   : 6.513  
 3rd Qu.:0.001850   3rd Qu.:0.3538   3rd Qu.: 5.145  
 Max.   :0.021109   Max.   :0.7019   Max.   :91.019  

mining info:
         data ntransactions support confidence
 transactions         39320   0.001       0.25


Inspecting a few rules.

For example the first rule below states customers who bought {Organic Fuji Apples} are likely to also purchase {Bag of Organic Bananas} with a lift of 3.6. A lift is how much more likely an item will be purchased given the other item. The higher the lift the better. With the table below we can interactively sort the lift, support, and confidence .

inspectDT(rules)


Interactive Visualization

Sifting through rules can be time consuming. The interactive scatter plot below efficiently communicates relationships between lift, confidence and support.

plotly_arules(rules)


Graph plot

plot(rules[1:10], method="graph" )


source of image (https://www.instacart.com/)
R - Association Rules - Market Basket Analysis(https://www.youtube.com/watch?v=b5hgDPa7a2k)
LS0tCnRpdGxlOiAiSW5zdGFjYXJ0OiBNYXJrZXQgQmFza2V0IEFuYWx5c2lzIgphdXRob3I6IEJlcm5hcmQgQWRlCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazogCiAgICBjc3M6IHN0eWxlLmNzcwogICAgZmlnX3dpZHRoOiA3CiAgICBoaWdobGlnaHQ6IHRhbmdvCiAgICB0aGVtZTogcGFwZXIKLS0tCgoqKioKCjxicj4KCiFbXShpbnN0YWNhcnQuanBnKQoKPGJyPgoKSW5zdGFjYXJ0IGlzIGEgZ3JvY2VyeSBkZWxpdmVyeSBzdGFydHVwLiBUaGV5IGZhY2lsaXRhdGUgZGVsaXZlcmllcyBvZiBvbi1kZW1hbmQgZ3JvY2VyeSBvcmRlcnMgdG8gaG9tZXMgd2l0aGluIGFuIGhvdXIgaW4gbWFqb3IgVVMgY2l0ZXMuIEFzIG9mIEp1bmUgMjAxNyB0aGUgY29tcGFueSB2YWx1YXRpb24gaXMgYXQgJDMuNCBiaWxsaW9uLiBUaGV5IHJlY2VudGx5IHB1Ymxpc2hlZCBhIGRhdGFzZXQgb2YgZ3JvY2VyaWVzIGRhdGEgd2hpY2ggaXMgcGVyZmVjdCBmb3IgbWFya2V0IGJhc2tldCBhbmFseXNpcy4gIEluIHRoaXMgcG9zdCwgSSBpbnRlbmQgdG8gbGVhcm4gc29tZSBpbnRlcmVzdGluZyBwYXR0ZXJucyB0byBkZXRlcm1pbmUgd2hhdCBwcm9kdWN0cyBjdXN0b21lcnMgcHVyY2hhc2UgdG9nZXRoZXIuCgojIyNUaGUgRGF0YQpUaGUgZGF0YXNldCBjb250YWlucyBhIHNhbXBsZSBvZiBvdmVyIDMgbWlsbGlvbiBncm9jZXJ5IG9yZGVycyBmcm9tIG1vcmUgdGhhbiAyMDAsMDAwIEluc3RhY2FydCB1c2Vycy4gWW91IGNhbiBkb3dubG9hZCB0aGUgZGF0YXNldCBbaGVyZV0oaHR0cHM6Ly93d3cuaW5zdGFjYXJ0LmNvbS9kYXRhc2V0cy9ncm9jZXJ5LXNob3BwaW5nLTIwMTcpICAKCiMjI1RoZSBBbGdvcml0aG0KQXByaW9yaSBpcyBhbiBhbGdvcml0aG0gdGhhdCBtaW5lcyB0cmFuc2FjdGlvbiBkYXRhYmFzZXMgZm9yIGZyZXF1ZW50IGl0ZW1zZXRzIGVzcGVjaWFsbHkgd2hlbiBpdGVtcyBhcmUgYXNzb2NpYXRlZCB3aXRoIGVhY2ggb3RoZXIuIFRoaXMgY2FuIGJlIHZlcnkgdXNlZnVsIGZvciBhbmFseXppbmcgc2hvcHBpbmcgY2FydHMgYW5kIGZpbmRpbmcgaW50ZXJlc3RpbmcgcGF0dGVybnMuIEkgd2lsbCBiZSB1c2luZyB0aGUgKmFydWxlcyogcGFja2FnZSBpbiBSIHdoaWNoIHByb3ZpZGVzIHRoZSBmYWNpbGl0eSB0byBpbXBsZW1lbnQgYXByaW9yaS4KClVzaW5nIHRoZSBhcHJpb3JpIGFsZ29yaXRobSwgd2UgbWlnaHQgZmluZCBydWxlcyB0aGF0IGluZGljYXRlIGdyb2NlcnkgaXRlbXMgdGhhdCBhcmUgY29tbW9ubHkgYm91Z2h0IHRvZ2V0aGVyIHdpdGggb3RoZXIgaXRlbXMuICBGb3IgaW5zdGFuY2UsIHRoZSBydWxlIHtQZWFudXQgQnV0dGVyLCBKZWxseX0gPT4ge0JyZWFkfSBpbmRpY2F0ZXMgdGhhdCBjdXN0b21lcnMgd2hvIHB1cmNoYXNlIFBlYW51dCBCdXR0ZXIgJiBKZWxseSBhcmUgbGlrZWx5IHRvIHB1cmNoYXNlIEJyZWFkLiAgVGhpcyBpbmZvcm1hdGlvbiBjYW4gYmUgdmVyeSB1c2VmdWwgZm9yIHByb2R1Y3QgcmVjb21tZW5kYXRpb25zLCBjcm9zcyBtYXJrZXRpbmcsIGFuZCBwcm9tb3Rpb25zIHRvIGluY3JlYXNlIHNhbGVzLgoKVGhlcmUgYXJlIHRocmVlIHBhcmFtZXRlcnMgdGhhdCBhcmUga2V5IHRvIHVuZGVyc3RhbmRpbmcgdGhlIGFwcmlvcmkgYWxnb3JpdGhtLCB0aGV5IGluY2x1ZGUgQ29uZmlkZW5jZSwgU3VwcG9ydCwgYW5kIExpZnQuCgoqIFN1cHBvcnQgaXMgYW4gaW5kaWNhdGlvbiBvZiBob3cgZnJlcXVlbnRseSB0aGUgaXRlbXNldCBhcHBlYXJzIGluIHRoZSBkYXRhc2V0LgoKKiBDb25maWRlbmNlIGlzIGFuIGluZGljYXRpb24gb2YgaG93IG9mdGVuIHRoZSBydWxlIGhhcyBiZWVuIGZvdW5kIHRvIGJlIHRydWUuCgoqIEEgbGlmdCBncmVhdGVyIHRoYW4gMSBsZXRzIHVzIGtub3cgdGhlIGRlZ3JlZSB0byB3aGljaCBvY2N1cnJlbmNlcyBhcmUgZGVwZW5kZW50IG9uIG9uZSBhbm90aGVyLgoKKioqCgojIyMjTG9hZCBwYWNrYWdlcwpgYGB7cn0KbGlicmFyeShhcnVsZXMpCmxpYnJhcnkoYXJ1bGVzVml6KQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShwbG90bHkpCmBgYAo8YnI+CgojIyMjSW1wb3J0IERhdGEKYGBge3J9Cm9yZGVyX3Byb2R1Y3RzID0gcmVhZC5jc3YoIm9yZGVyX3Byb2R1Y3RzX3RyYWluLmNzdiIsIG5yb3dzID0gNzUwMDAwKSAjYW5hbHl6aW5nIDc1MCwwMDAgcm93cwpwcm9kdWN0cyA9IHJlYWQuY3N2KCJwcm9kdWN0cy5jc3YiKQoKYGBgCjxicj4KCiMjIyMgRGF0YSBXcmFuZ2xpbmcKYGBge3J9CiBkYXRhID0gCiAgbGVmdF9qb2luKG9yZGVyX3Byb2R1Y3RzLCBwcm9kdWN0cywgYnkgPSAicHJvZHVjdF9pZCIpICU+JQogIHNlbGVjdChvcmRlcl9pZCwgcHJvZHVjdF9uYW1lKQpgYGAKPGJyPgoKIyMjI0luc3BlY3QgRGF0YXNldApgYGB7cn0KZGF0YQpgYGAKPGJyPgoKQ29udmVydCBkYXRhZnJhbWUgaW50byAqKnRyYW5zYWN0aW9ucyoqIGFzIHJlcXVpcmVkIGJ5IGFydWxlcyBwYWNrYWdlLiBUaGUgKip0cmFuc2FjdGlvbnMqKiBkYXRhIGZvcm1hdCBmb3JjZXMgYSBkYXRhZnJhbWUgaW50byBhIHNwYXJzZSBtYXRyaXggd2l0aCAzOTMyMCB0cmFuc2FjdGlvbnMgKHJvd3MpIGFuZCAzMDg5NiBpdGVtcyAoY29sdW1ucykuICBFYWNoIGNlbGwgaW4gdGhlIG1hdHJpeCByZWNvcmRzIDEgaWYgY3VzdG9tZXIgcGxhY2VkIGFuIG9yZGVyIGZvciBwcm9kdWN0LCBpZiBub3QgdGhlbiBPLgpgYGB7cn0Kd3JpdGVfY3N2KGRhdGEsICdkYXRhLmNzdicpCnRyYW5zYWN0aW9ucyA9IHJlYWQudHJhbnNhY3Rpb25zKCdiYXNrZXRfdG1wLmNzdicsIGZvcm1hdCA9ICJzaW5nbGUiLCBzZXAgPSAiLCIsIGNvbHMgPSBjKCJvcmRlcl9pZCIsICJwcm9kdWN0X25hbWUiKSkKdHJhbnNhY3Rpb25zCmBgYAo8YnI+CgojIyMjIEluc3BlY3QgYSBmZXcgdHJhbnNhY3Rpb25zLgpgYGB7cn0KaW5zcGVjdCh0cmFuc2FjdGlvbnNbMTo1XSkKYGBgCjxicj4KCiMjIyNQbG90IG9mIHRoZSB0b3AgMjAgaXRlbXMgdGhhdCBzaG93cyB1cCBmcmVxdWVudGx5IGluIHRyYW5zYWN0aW9ucy4KYGBge3J9Cml0ZW1GcmVxdWVuY3lQbG90KHRyYW5zYWN0aW9ucywgdHlwZT0iYWJzb2x1dGUiLCB0b3AgPSAyMCwgY2V4Lm5hbWVzID0gMC45LCBib3JkZXIgPSBOQSApCmBgYAo8YnI+CgojIyNGcmVxdWVudCBpdGVtc2V0cy4KRnJ1aXQgYW5kIHZlZ2V0YWJsZXMgYXJlIHZlcnkgcG9wdWxhciBncm9jZXJ5IGNhcnRzIGl0ZW1zLgpgYGB7cn0KCmJhc2tldF9zZXRzID0gYXByaW9yaSh0cmFuc2FjdGlvbnMsIHBhcmFtZXRlciA9IGxpc3Qoc3VwcD0wLjAwMiwgbWlubGVuPTIpLCBjb250cm9sID0gbGlzdCh2ZXJib3NlID0gRkFMU0UpKQpiYXNrZXRfc2V0X2RhdGFmcmFtZSA9IERBVEFGUkFNRShzb3J0KGl0ZW1zZXRzLCBieT0ic3VwcG9ydCIsIGRlY3JlYXNpbmcgPSBUKSkKYmFza2V0X3NldF9kYXRhZnJhbWUKYGBgCjxicj4KCiMjI1Bsb3Qgb2YgZnJlcXVlbnQgaXRlbXNldHMuCmBgYHtyfQpwYXIobWFyPWMoMywgMjAsIDAsIDApICsgLjQpCmJhcnBsb3QoYmFza2V0X3NldF9kYXRhZnJhbWUkc3VwcG9ydCwgbmFtZXMgPSBiYXNrZXRfc2V0X2RhdGFmcmFtZSRpdGVtcywgbGFzID0gMiwgaG9yaXogPSBULCBjZXgubmFtZXMgPSAwLjksIGJvcmRlciA9IE5BLCBjZXguYXhpcyA9IC44ICkKYGBgCjxicj4KCiMjI1J1bGVzIExlYXJuaW5nClRoaXMgaXMgdGhlIGhlYXJ0IG9mIHRoZSBhcHJpb3JpIGFsZ29yaXRobSB3aGVyZSB3ZSBzZXQgdXAgYSAqcnVsZXMqIG9iamVjdCB0byBtaW5lIGZyZXF1ZW50IGl0ZW1zZXRzLgpgYGB7cn0KcnVsZXMgPSBhcHJpb3JpKHRyYW5zYWN0aW9ucywgcGFyYW1ldGVyID0gbGlzdChzdXBwb3J0ID0gMC4wMDEsIGNvbmZpZGVuY2U9MC4yNSwgbWlubGVuID0gMiApKQpgYGAKPGJyPgoKIyMjU3VtbWFyeSBvZiBSdWxlcwpgYGB7cn0Kc3VtbWFyeShydWxlcykKYGBgCjxicj4KCiMjI0luc3BlY3RpbmcgYSBmZXcgcnVsZXMuICAKCkZvciBleGFtcGxlIHRoZSBmaXJzdCBydWxlIGJlbG93IHN0YXRlcyBjdXN0b21lcnMgd2hvIGJvdWdodCB7T3JnYW5pYyBGdWppIEFwcGxlc30gYXJlIGxpa2VseSB0byBhbHNvIHB1cmNoYXNlIHtCYWcgb2YgT3JnYW5pYyBCYW5hbmFzfSB3aXRoIGEgbGlmdCBvZiAzLjYuIEEgbGlmdCBpcyBob3cgbXVjaCBtb3JlIGxpa2VseSBhbiBpdGVtIHdpbGwgYmUgcHVyY2hhc2VkIGdpdmVuIHRoZSBvdGhlciBpdGVtLiAgVGhlIGhpZ2hlciB0aGUgbGlmdCB0aGUgYmV0dGVyLiAgV2l0aCB0aGUgdGFibGUgYmVsb3cgd2UgY2FuIGludGVyYWN0aXZlbHkgc29ydCB0aGUgbGlmdCwgc3VwcG9ydCwgYW5kIGNvbmZpZGVuY2UgLgpgYGB7cn0KaW5zcGVjdERUKHJ1bGVzKQpgYGAKPGJyPgoKSW50ZXJhY3RpdmUgVmlzdWFsaXphdGlvbgoKU2lmdGluZyB0aHJvdWdoIHJ1bGVzIGNhbiBiZSB0aW1lIGNvbnN1bWluZy4gIFRoZSBpbnRlcmFjdGl2ZSBzY2F0dGVyIHBsb3QgYmVsb3cgZWZmaWNpZW50bHkgY29tbXVuaWNhdGVzIHJlbGF0aW9uc2hpcHMgYmV0d2VlbiBsaWZ0LCBjb25maWRlbmNlIGFuZCBzdXBwb3J0LgpgYGB7cn0KcGxvdGx5X2FydWxlcyhydWxlcykKYGBgCjxicj4KCiMjI0dyYXBoIHBsb3QKYGBge3J9CnBsb3QocnVsZXNbMToxMF0sIG1ldGhvZD0iZ3JhcGgiICkKYGBgCgoqKioKIyMjIyMjc291cmNlIG9mIGltYWdlIChodHRwczovL3d3dy5pbnN0YWNhcnQuY29tLykKCiMjIyMjI3NvdXJjZSBvZiBkYXRhKGh0dHBzOi8vd3d3Lmluc3RhY2FydC5jb20vZGF0YXNldHMvZ3JvY2VyeS1zaG9wcGluZy0yMDE3KQoKIyMjIyMjQXNzb2NpYXRpb24gcnVsZSBsZWFybmluZyhodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9Bc3NvY2lhdGlvbl9ydWxlX2xlYXJuaW5nI1N1cHBvcnQpCgojIyMjIyNSIC0gQXNzb2NpYXRpb24gUnVsZXMgLSBNYXJrZXQgQmFza2V0IEFuYWx5c2lzKGh0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9YjVoZ0RQYTdhMmspCgo=